引入下面的记号:
神经网络第$l$层的计算过程:$\zv_l = \Wv_l \av_{l-1} + \bv_l$,$\av_l = h_l (\zv_l)$
整个网络:$\xv = \av_0 \xrightarrow{\Wv_1,\bv_1} \zv_1 \xrightarrow{h_1} \av_1 \xrightarrow{\Wv_2,\bv_2} \cdots \xrightarrow{\Wv_L,\bv_L} \zv_L \xrightarrow{h_L} \av_L = \hat{\yv}$
最早的 M-P 模型采用阶跃函数$\sgn(\cdot)$作为激活函数
改进方向:
常见的有
将$\Rbb$挤压到$[0,1]$,输出拥有概率意义:
$$ \begin{align*} \qquad \sigma(z) = \frac{1}{1 + \exp (-z)} = \begin{cases} 1, & z \rightarrow \infty \\ 0, & z \rightarrow -\infty \end{cases} \end{align*} $$
对率函数连续可导,在零处导数最大
$$ \begin{align*} \qquad \nabla \sigma(z) = \sigma(z) (1 - \sigma(z)) \le \left( \frac{\sigma(z) + 1 - \sigma(z)}{2} \right)^2 = \frac{1}{4} \end{align*} $$
均值不等式等号成立的条件是$\sigma(z) = 1 - \sigma(z)$,即$z = 0$
将$\Rbb$挤压到$[-1,1]$,输出零中心化,对率函数的放大平移
$$ \begin{align*} \qquad \tanh(z) & = \frac{\exp(z) - \exp(-z)}{\exp(z) + \exp(-z)} = \frac{1 - \exp(-2z)}{1 + \exp(-2z)} = 2 \sigma(2z) - 1 \\[2pt] & = \begin{cases} 1, & z \rightarrow \infty \\ -1, & z \rightarrow -\infty \end{cases} \\[10pt] \nabla \tanh(z) & = 4 \sigma(2z) (1 - \sigma(2z)) \le 1 \end{align*} $$
双曲正切函数连续可导,在$z = 0$处导数最大
输出零中心化使得非输入层的输入都在零附近,而双曲正切函数在零处导数最大,梯度下降更新效率较高,对率函数输出恒为正,会减慢梯度下降的收敛速度
整流线性单元 (rectified linear unit, ReLU):
$$ \begin{align*} \qquad \relu(z) = \max \{ 0, z \} = \begin{cases} z & z \ge 0 \\ 0 & z < 0 \end{cases} \end{align*} $$
优点
缺点
由链式法则有
$$ \begin{align*} \qquad \nabla_{\wv} \relu(\wv^\top \xv + b) & = \frac{\partial \relu(\wv^\top \xv + b)}{\partial (\wv^\top \xv + b)} \frac{\partial (\wv^\top \xv + b)}{\partial \wv} \\ & = \frac{\partial \max \{ 0, \wv^\top \xv + b \}}{\partial (\wv^\top \xv + b)} \xv \\ & = \Ibb(\wv^\top \xv + b \ge 0) \xv \end{align*} $$
如果第一个隐藏层中的某个神经元对应的$(\wv,b)$初始化不当,使得对任意$\xv$有$\wv^\top \xv + b < 0$,那么其关于$(\wv,b)$的梯度将为零,在以后的训练过程中永远不会被更新
解决方案:带泄漏的 ReLU,带参数的 ReLU,ELU,Softplus
带泄漏的 ReLU:当$\wv^\top \xv + b < 0$时也有非零梯度
$$ \begin{align*} \qquad \lrelu(z) & = \begin{cases} z & z \ge 0 \\ \gamma z & z < 0 \end{cases} \\ & = \max \{ 0, z \} + \gamma \min \{ 0, z \} \overset{\gamma < 1}{=} \max \{ z, \gamma z \} \end{align*} $$
其中斜率$\gamma$是一个很小的常数,比如$0.01$
带参数的 ReLU:斜率$\gamma_i$可学习
$$ \begin{align*} \qquad \prelu(z) & = \begin{cases} z & z \ge 0 \\ \gamma_i z & z < 0 \end{cases} \\[4pt] & = \max \{ 0, z \} + \gamma_i \min \{ 0, z \} \end{align*} $$
可以不同神经元有不同的参数,也可以一组神经元共享一个参数
指数线性单元 (exponential linear unit, ELU)
$$ \begin{align*} \qquad \elu(z) & = \begin{cases} z & z \ge 0 \\ \gamma (\exp(z) - 1) & z < 0 \end{cases} \\[4pt] & = \max \{ 0, z \} + \min \{ 0, \gamma (\exp(z) - 1) \} \end{align*} $$
Softplus 函数可以看作 ReLU 的平滑版本:
$$ \begin{align*} \qquad \softplus(z) = \ln (1 + \exp(z)) \end{align*} $$
其导数为对率函数
$$ \begin{align*} \qquad \nabla \softplus(z) = \frac{\exp(z)}{1 + \exp(z)} = \frac{1}{1 + \exp(-z)} \end{align*} $$
Swish 函数是一种自门控 (self-gated) 激活函数:
$$ \begin{align*} \qquad \swish(z) = z \cdot \sigma (\beta z) = \frac{z}{1 + \exp(-\beta z)} \end{align*} $$
其中$\beta$是可学习的参数或一个固定超参数
考虑神经网络的第$l$层:
$$ \begin{align*} \qquad \zv_l & = \Wv_l \av_{l-1} + \bv_l \\ \av_l & = h_l (\zv_l) \end{align*} $$
前面提到的激活函数都是$\Rbb \mapsto \Rbb$的,即$[\av_l]_i = h_l ([\zv_l]_i), ~ i \in [n_l]$
Maxout 单元是$\Rbb^{n_l} \mapsto \Rbb$的,输入就是$\zv_l$,其定义为
$$ \begin{align*} \qquad \maxout (\zv) = \max_{k \in [K]} \{ \wv_k^\top \zv + b_k \} \end{align*} $$
前$L-1$层是复合函数$\psi: \Rbb^d \mapsto \Rbb^{n_{L-1}}$,可看作一种特征变换方法
最后一层是学习器$\hat{\yv} = g(\psi(\xv); \Wv_L, \bv_L)$,对输入的$\psi(\xv)$进行预测
对率回归也可看作只有一层(没有隐藏层)的神经网络
传统机器学习:特征工程和模型学习两阶段分开进行
深度学习:特征工程和模型学习合二为一,端到端 (end-to-end)
整个网络:$\xv = \av_0 \xrightarrow{\Wv_1,\bv_1} \zv_1 \xrightarrow{h_1} \av_1 \xrightarrow{\Wv_2,\bv_2} \cdots \xrightarrow{\Wv_L,\bv_L} \zv_L \xrightarrow{h_L} \av_L = \hat{\yv}$
神经网络的优化目标为
$$ \begin{align*} \qquad \min_{\Wv, \bv} ~ \frac{1}{m} \sum_{i \in [m]} \ell (\yv_i, \hat{\yv}_i) \end{align*} $$
其中损失$\ell (\yv, \hat{\yv})$的计算为正向传播
梯度下降更新公式为
$$ \begin{align*} \qquad \Wv ~ \leftarrow ~ \Wv - \frac{\eta}{m} \sum_{i \in [m]} \class{yellow}{\frac{\partial \ell (\yv_i, \hat{\yv}_i)}{\partial \Wv}}, \quad \bv ~ \leftarrow ~ \bv - \frac{\eta}{m} \sum_{i \in [m]} \class{yellow}{\frac{\partial \ell (\yv_i, \hat{\yv}_i)}{\partial \bv}} \end{align*} $$
整个网络:$\xv = \av_0 \xrightarrow{\Wv_1,\bv_1} \zv_1 \xrightarrow{h_1} \av_1 \xrightarrow{\Wv_2,\bv_2} \cdots \xrightarrow{\Wv_L,\bv_L} \zv_L \xrightarrow{h_L} \av_L = \hat{\yv}$
最后一层$\zv_L = \Wv_L ~ \av_{L-1} + \bv_L$,$\av_L = h_L (\zv_L)$,由链式法则有
$$ \begin{align*} \qquad \frac{\partial \ell (\yv, \hat{\yv})}{\partial \bv_L} & = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_L} \frac{\partial \zv_L}{\partial \bv_L} = \deltav_L^\top \frac{\partial \zv_L}{\partial \bv_L} = \deltav_L^\top \\ \frac{\partial \ell (\yv, \hat{\yv})}{\partial \Wv_L} & = \sum_{j \in [n_L]} \frac{\partial \ell (\yv, \hat{\yv})}{\partial [\zv_L]_j} \frac{\partial [\zv_L]_j}{\partial \Wv_L} = \sum_{j \in [n_L]} [\deltav_L]_j \frac{\partial [\zv_L]_j}{\partial \Wv_L} \end{align*} $$
其中$\deltav_L^\top = \partial \ell (\yv, \hat{\yv}) / \partial \zv_L \in \Rbb^{n_L}$为第$L$层的误差项,可直接求解
类似的,对第$l$层$\zv_l = \Wv_l \av_{l-1} + \bv_l$,$\av_l = h_l (\zv_l)$,由链式法则有
$$ \begin{align*} \qquad \frac{\partial \ell (\yv, \hat{\yv})}{\partial \bv_l} = \deltav_l^\top, \quad \frac{\partial \ell (\yv, \hat{\yv})}{\partial \Wv_l} = \sum_{j \in [n_l]} [\deltav_l]_j \frac{\partial [\zv_l]_j}{\partial \Wv_l} \end{align*} $$
其中$\deltav_l^\top = \partial \ell (\yv, \hat{\yv}) / \partial \zv_l \in \Rbb^{n_l}$为第$l$层的误差项
反向传播 (backpropagation, BP):前一层误差由后一层得到
$$ \begin{align*} \qquad \deltav_{l-1}^\top = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_{l-1}} = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_l} \frac{\partial \zv_l}{\partial \av_{l-1}} \frac{\partial \av_{l-1}}{\partial \zv_{l-1}} = \deltav_l^\top \Wv_l \frac{\partial h_{l-1}(\zv_{l-1})}{\partial \zv_{l-1}} \end{align*} $$
最后对第$l$层$\zv_l = \Wv_l \av_{l-1} + \bv_l$,如何求$\partial [\zv_l]_j / \partial \Wv_l$?
注意$z_j = \sum_k w_{jk} a_k + b_k$只与$\Wv$的第$j$行有关,于是
$$ \begin{align*} \qquad & \frac{\partial z_j}{\partial \Wv} = \underbrace{\begin{bmatrix} \zerov, \ldots, \av, \ldots, \zerov \end{bmatrix}}_{\text{only }\av\text{ at }j\text{-th column}} = \av \ev_j^\top \\[4pt] \qquad & \Longrightarrow \frac{\partial \ell (\yv, \hat{\yv})}{\partial \Wv_l} = \sum_{j \in [n_l]} [\deltav_l]_j \frac{\partial [\zv_l]_j}{\partial \Wv_l} = \av_{l-1} \sum_{j \in [n_l]} [\deltav_l]_j \ev_j^\top = \av_{l-1} \deltav_l^\top \end{align*} $$
输入:训练集,验证集,相关超参数
输出:$\Wv$和$\bv$
import numpy as np from sklearn.neural_network import MLPClassifier mlp = MLPClassifier( hidden_layer_sizes=(h), # 隐藏层神经元个数 activation='logistic', # identity, logistic, tanh, relu max_iter=100, # 最大迭代轮数 solver='lbfgs', # 求解器 alpha=0, # 正则项系数 batch_size=32, # 批量大小 learning_rate='constant', # constant, invscaling, adaptive shuffle=True, # 每轮是否将样本重新排序, momentum=0.9, # 动量法系数, sgd only nesterovs_momentum=True, # 动量法用Nesterov加速 early_stopping=False, # 是否提早停止 warm_start=False, # 是否开启热启动机制 random_state=1, verbose=False ... ) clf = mlp.fit(X, y) acc = clf.score(X, y)
from tensorflow.keras.layers import Dense from tensorflow.keras.models import Sequential from tensorflow.keras.optimizers import Adam model = Sequential() model.add(Dense(units=3, activation="sigmoid", input_shape=(2, ))) model.add(Dense(units=1, activation='sigmoid')) model.summary() # 打印模型 _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 3) 9 dense_1 (Dense) (None, 1) 4 ================================================================= Total params: 13 Trainable params: 13 Non-trainable params: 0 _________________________________________________________________ model.compile( optimizer=Adam(0.1), loss="binary_crossentropy", metrics=['accuracy'] ) model.fit(X, y, epochs=10, batch_size=32) Epoch 1/10 32/32 [==============] - 0s 1ms/step - loss: 0.6481 - accuracy: 0.6309 Epoch 2/10 32/32 [==============] - 0s 1ms/step - loss: 0.5064 - accuracy: 0.7500 Epoch 3/10 32/32 [==============] - 0s 1000us/step - loss: 0.3309 - accuracy: 0.8369 Epoch 4/10 32/32 [==============] - 0s 1ms/step - loss: 0.1383 - accuracy: 1.0000 Epoch 5/10 32/32 [==============] - 0s 1ms/step - loss: 0.0643 - accuracy: 1.0000 Epoch 6/10 32/32 [==============] - 0s 1ms/step - loss: 0.0395 - accuracy: 1.0000 Epoch 7/10 32/32 [==============] - 0s 1ms/step - loss: 0.0276 - accuracy: 1.0000 Epoch 8/10 32/32 [==============] - 0s 1ms/step - loss: 0.0208 - accuracy: 1.0000 Epoch 9/10 32/32 [==============] - 0s 994us/step - loss: 0.0165 - accuracy: 1.0000 Epoch 10/10 32/32 [==============] - 0s 997us/step - loss: 0.0134 - accuracy: 1.0000 loss, acc = model.evaluate(X, y, verbose=2) 32/32 - 0s - loss: 0.0121 - accuracy: 1.0000 - 93ms/epoch - 3ms/step
神经网络中误差反向传播的迭代公式为
$$ \begin{align*} \qquad \deltav_l^\top = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_l} = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_{l+1}} \frac{\partial \zv_{l+1}}{\partial \av_l} \frac{\partial \av_l}{\partial \zv_l} = \deltav_{l+1}^\top \Wv_{l+1} \diag (h_l'(\zv_l)) \end{align*} $$
对于 Sigmoid 型激活函数
误差每传播一层都会乘以一个小于等于$1$的系数,当网络层数很深时,梯度会不断衰减甚至消失,使得整个网络很难训练
解决方案:使用导数比较大的激活函数,比如 ReLU
残差模块 $\zv_l = \av_{l-1} + \class{yellow}{\Uv_2 \cdot h(\Uv_1 \cdot \av_{l-1} + \cv_1) + \cv_2} = \av_{l-1} + \class{yellow}{f(\av_{l-1})}$
假设$\av_l = \zv_l$,即残差模块输出不使用激活函数,对$\forall t \in [l]$有
$$ \begin{align*} \quad \av_l = \av_{l-1} + f(\av_{l-1}) = \av_{l-2} + f(\av_{l-2}) + f(\av_{l-1}) = \cdots = \av_{l-t} + \sum_{i=l-t}^{l-1} f(\av_i) \end{align*} $$
低层输入可以恒等传播到任意高层
低层输入可以恒等传播到任意高层
$$ \begin{align*} \quad \av_l = \av_{l-t} + \sum_{i=l-t}^{l-1} f(\av_i) \end{align*} $$
由链式法则有
$$ \begin{align*} \quad \frac{\partial \ell}{\partial \av_{l-t}} & = \frac{\partial \ell}{\partial \av_l} \frac{\partial \av_l}{\partial \av_{l-t}} = \frac{\partial \ell}{\partial \av_l} \left( \frac{\partial \av_{l-t}}{\partial \av_{l-t}} + \frac{\partial }{\partial \av_{l-t}} \sum_{i=l-t}^{l-1} f(\av_i) \right) \\ & = \frac{\partial \ell}{\partial \av_l} \left( \Iv + \frac{\partial }{\partial \av_{l-t}} \sum_{i=l-t}^{l-1} f(\av_i) \right) \\ & = \frac{\partial \ell}{\partial \av_l} + \frac{\partial \ell}{\partial \av_l} \left( \frac{\partial }{\partial \av_{l-t}} \sum_{i=l-t}^{l-1} f(\av_i) \right) \end{align*} $$
高层误差可以恒等传播到任意低层,梯度消失得以缓解
神经网络已被扩展到多种类型的数据上
一个优秀丹师的自我修养: